MAIR/APAN Workshop 2024

Last week’s election saw surprising swings away from the Democrats at the Presidential level, including among many blue stronghold states. It is useful to place these political movements within the broader historical context of US Presidential elections. Today, we will use R to collect, explore, and communicate data on past US Presidential Elections. We will use the stories that emerge from trends in these data to develop our understanding of the most recent election and those to come.

  1. We will start by collecting the relevant data using an API.

  2. We will then clean up and explore these data, readying ourselves for some analysis.

  3. We will explore various ways of visualizing these data, including creating interactive plots and maps.

  4. Finally, we will explore tools in R to communicate our findings.

Set up

To follow along with this workshop, you will need:

  1. The latest version of R and RStudio

  2. A new RProject

  3. This workshop.qmd file (which you can download from https://github.com/hgoers/mair_apan_workshop_2024)

  4. The following R packages installed on to your local device:

install.packages(c("tidyverse",
                   "devtools",
                   "dataverse",
                   "here",
                   "glue",
                   "ggtext",
                   "skimr",
                   "MetBrewer",
                   "scales",
                   "maps",
                   "mapdata",
                   "leaflet",
                   "sf",
                   "htmltools",
                   "patchwork",
                   "ggh4x",
                   "geofacet"))
Tip

If you hover over a code chunk, you should see a little copy icon appear in the top right hand corner. Clicking on that copies the code in the code chunk. You can then easily paste this in your own script.

Collecting your data

We will start by collecting data on historical US Presidential state-level returns. The MIT Election Data + Science Lab provides these returns to the public. To access their data, we will use the Harvard Dataverse API. We will access it using the R package dataverse.

Note

The dataverse package accesses the Harvard Dataverse Application Programming Interface (API) in the background. To learn more about how to access this API directly, check out the Harvard Dataverse documentation.

library(tidyverse)
library(dataverse)

We will be accessing the U.S. President 1976–2020 data set. To do this, we need three pieces of information:

  1. The name of the file we want to download

  2. The data set’s DOI

  3. The data set’s format

We can get all of this information from the data set’s page on the Harvard Dataverse website.

state_results_df <- get_dataframe_by_name(
  filename = "1976-2020-president.tab",
  dataset = "10.7910/DVN/42MVDX",
  server = "dataverse.harvard.edu",
  original = T,
  .f = readr::read_csv
)

This code programmatically pulls the most up-to-date data set from the API. Once the MIT Election Lab publishes the 2024 Presidential Election data, you will only need to update the file name to be able to access it.

Saving your data

You should always save a copy of your data locally, particularly when those data are updated regularly. The results of your analysis may change with updates to the raw data. You want to be able to track those changes (and uncover why they occurred).

The here R package makes saving data in a robust way very easy. Using here::here(), you can automatically update file paths to reflect the computer you are currently using. To demonstrate, run the following code:

here::here()
[1] "/Users/harrietgoers/Documents/mair_apan_workshop_2024"

Your output will be different to mine. here::here() locates the top level of the file directory in which you are located. This is helpful when you share your code with others (or switch to a new computer). Hard coded file paths will cause annoying errors.

I like to save my raw data in a folder called data-raw. Go ahead a make a similar folder in your RProject.

To point here::here() to this folder, you simply supply it with the sub-folder name(s) as a string:

here::here("data-raw")
[1] "/Users/harrietgoers/Documents/mair_apan_workshop_2024/data-raw"

I am going to save a copy of the latest US Presidential Election results. I am going to include in the file name the date that I pulled the data. This stops me from overwriting previous versions of the data. I need those previous versions if I want to track changes made.

Tip

I use glue::glue() to evaluate my R code (enclosed in those curly brackets) and convert the output into a string.

I use lubridate::today() (loaded in with the library(tidyverse) command) to get today’s date.

write_csv(state_results_df, 
          here::here("data-raw", glue::glue("pres_state_results_{today()}.csv")))

Reading in the latest version of your data

To make sure you are reading in the latest version of your data, you can run the following:

state_results_df <- list.files(path = here::here("data-raw"),
           pattern = "pres_state_results_.*",
           full.names = T) |> 
  max() |> 
  read_csv()

Exploring your data

We now have the latest published data on the number of votes each US Presidential candidate received in all elections from 1976 to 2020. Let’s take a good look at it.

First, I want to get a quick overview of my data. I like to use skimr::skim() to do this:

skimr::skim(state_results_df)
Data summary
Name state_results_df
Number of rows 4287
Number of columns 15
_______________________
Column type frequency:
character 6
logical 2
numeric 7
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
state 0 1.00 4 20 0 51 0
state_po 0 1.00 2 2 0 51 0
office 0 1.00 12 12 0 1 0
candidate 287 0.93 4 38 0 269 0
party_detailed 456 0.89 3 40 0 172 0
party_simplified 0 1.00 5 11 0 4 0

Variable type: logical

skim_variable n_missing complete_rate mean count
writein 3 1 0.11 FAL: 3807, TRU: 477
notes 4287 0 NaN :

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
year 0 1 1999.08 14.22 1976 1988 2000 2012.0 2020 ▇▅▅▅▇
state_fips 0 1 28.62 15.62 1 16 28 41.0 56 ▇▇▇▇▇
state_cen 0 1 53.67 26.03 11 33 53 81.0 95 ▆▆▇▃▇
state_ic 0 1 39.75 22.77 1 22 42 61.0 82 ▇▆▇▇▅
candidatevotes 0 1 311907.59 764801.10 0 1177 7499 199241.5 11110250 ▇▁▁▁▁
totalvotes 0 1 2366924.15 2465008.31 123574 652274 1569180 3033118.0 17500881 ▇▂▁▁▁
version 0 1 20210113.00 0.00 20210113 20210113 20210113 20210113.0 20210113 ▁▁▇▁▁

Checking your data set is complete

Let’s make sure that we have all years covered. US Presidential elections occur every four years, so we should have 12 unique years in our data set. Let’s confirm that:

state_results_df |> 
  distinct(year)
# A tibble: 12 × 1
    year
   <dbl>
 1  1976
 2  1980
 3  1984
 4  1988
 5  1992
 6  1996
 7  2000
 8  2004
 9  2008
10  2012
11  2016
12  2020

Great! I wonder if the number of presidential candidates has changed over time. Let’s see how many people ran each year:

n_candidates_annual <- state_results_df |> 
  distinct(year, candidate) |> 
  drop_na() |> 
  count(year)

n_candidates_annual
# A tibble: 12 × 2
    year     n
   <dbl> <int>
 1  1976    15
 2  1980    21
 3  1984    17
 4  1988    18
 5  1992    25
 6  1996    24
 7  2000    20
 8  2004    23
 9  2008    27
10  2012    35
11  2016    32
12  2020   102

This is a bit unwieldy. Let’s visualize it to get a better sense:

ggplot(n_candidates_annual, aes(x = year, y = n)) + 
  geom_line(linewidth = 1) + 
  geom_point() + 
  theme_minimal() + 
  theme(plot.title = element_text(face = "bold"),
        plot.title.position = "plot") + 
  labs(title = glue::glue("Number of people running for US President, {min_year} - {max_year}"),
       x = "Year",
       y = NULL,
       caption = "Source: MIT Election Lab")

Hmm, that’s a shocking spike in 2020. What happened? Let’s take a closer look at the data.

First, let’s confirm that these are indeed individual candidates:

state_results_df |> 
  filter(year == 2020) |> 
  distinct(candidate) |> 
  drop_na()
# A tibble: 102 × 1
   candidate                          
   <chr>                              
 1 "BIDEN, JOSEPH R. JR"              
 2 "TRUMP, DONALD J."                 
 3 "JORGENSEN, JO"                    
 4 "DE LA FUENTE, ROQUE \"\"ROCKY\"\""
 5 "PIERCE, BROCK"                    
 6 "JANOS, JAMES G. \"JESSE VENTURA\""
 7 "BLANKENSHIP, DON"                 
 8 "LA RIVA, GLORIA ESTELLA"          
 9 "CUMMINGS, DANIEL CLYDE"           
10 "SIMMONS, JADE"                    
# ℹ 92 more rows

Let’s then check whether one party is driving this:

party_count_annual <- state_results_df |> 
  distinct(year, party_simplified, candidate) |> 
  drop_na(candidate) |> 
  count(year, party_simplified)
ggplot(party_count_annual, aes(x = year, y = n, colour = party_simplified)) + 
  geom_line() + 
  geom_point() + 
  theme_minimal() + 
  theme(plot.title = element_text(face = "bold"),
        plot.title.position = "plot",
        legend.position = "top") + 
  labs(title = glue::glue("Number of people running for US President, {min_year} - {max_year}"),
       x = "Year",
       y = NULL,
       colour = "Party",
       caption = "Source: MIT Election Lab") + 
  scale_color_manual(values = MetBrewer::met.brewer("Homer1"))

It does appear that the number of “Other” candidates is unusually large (and has been driving the general trend of increasing numbers of candidates over the years). The other large party groups - the Democrats, Republicans, and Libertarians - appear to have fielded their usual number of candidates.

Could this be driven by a large number of write-in candidates?

write_in_count_annual <- state_results_df |> 
  drop_na(candidate, writein) |> 
  distinct(year, writein, candidate) |> 
  count(year, writein)

ggplot(write_in_count_annual, aes(x = year, y = n, colour = writein)) + 
  geom_line() + 
  geom_point() + 
  theme_minimal() + 
  theme(plot.title = element_text(face = "bold"),
        plot.title.position = "plot",
        legend.position = "top") + 
  labs(title = glue::glue("Number of people running for US President, {min_year} - {max_year}"),
       x = "Year",
       y = NULL,
       colour = "Party",
       caption = "Source: MIT Election Lab") + 
  scale_color_manual(values = MetBrewer::met.brewer("Homer1"))

Partially, yes. There is a huge spike in the number of candidates written in to the ballot. There also appears to be an increase in the number of candidates on the ballot; however, this appears to be following a general trend across time.

Note

Note that there are some years in which no write-in candidates emerged (1976 to 1984, 1992, 1996, and 2012). The way we calculated the number of write-in and non-write-in candidates above leaves these years as NAs, not zero counts. To fix this, do the following:

# Create a scope data frame
scope_df <- state_results_df |> 
  distinct(year) |> 
  crossing(writein = c(T, F))

# Append your calculations to that scope
write_in_count_annual_full <- scope_df |> 
  left_join(write_in_count_annual, by = join_by(year, writein)) |> 
  mutate(n = replace_na(n, 0))

ggplot(write_in_count_annual_full, aes(x = year, y = n, colour = writein)) + 
  geom_line() + 
  geom_point() + 
  theme_minimal() + 
  theme(plot.title = element_text(face = "bold"),
        plot.title.position = "plot",
        legend.position = "top") + 
  labs(title = glue::glue("Number of people running for US President, {min_year} - {max_year}"),
       x = "Year",
       y = NULL,
       colour = "Party",
       caption = "Source: MIT Election Lab") + 
  scale_color_manual(values = MetBrewer::met.brewer("Homer1"))

Could this be explained by one (or a small group of states)?

state_count_annual <- state_results_df |> 
  distinct(year, state_po, candidate) |> 
  drop_na(candidate) |> 
  count(year, state_po)
ggplot(state_count_annual, aes(x = year, y = n)) + 
  geom_line() + 
  geom_point() + 
  facet_wrap(~ state_po) + 
  theme_minimal() + 
  theme(plot.title = element_text(face = "bold"),
        plot.title.position = "plot") + 
  labs(title = glue::glue("Number of people running for US President, {min_year} - {max_year}"),
       x = "Year",
       y = NULL,
       colour = "Party",
       caption = "Source: MIT Election Lab") + 
  scale_x_continuous(breaks = seq(min_year, max_year, 20))

Question

Can anyone see any commonalities between the states in which there were large spikes in the number of candidates running?

Focusing on the major parties

We want to place this year’s Presidential election within its historical context. We have explored our data to get a general understanding of those trends, but we now want to formalize them.

To begin, let’s focus our data:

pres_state_results_df <- state_results_df |> 
  filter(
    # Isolate DEM and REP from other party candidates
    party_simplified %in% c("DEMOCRAT", "REPUBLICAN"),
    # Remove other candidates
    !candidate %in% c("OTHER", NA_character_) 
  ) |>
  # Get the total count of votes cast for other candidates, preserving
  # the total vote count for the state and the candidate
  group_by(year, state, state_po, party_simplified, candidate, totalvotes) |> 
  summarise(candidatevotes = sum(candidatevotes)) |>
  # Always remember to ungroup your data
  ungroup() |> 
  # Calculate the proportion of votes cast in each state won by each 
  # candidate
  mutate(prop_votes = candidatevotes / totalvotes)

pres_state_results_df
# A tibble: 1,224 × 8
    year state     state_po party_simplified candidate totalvotes candidatevotes
   <dbl> <chr>     <chr>    <chr>            <chr>          <dbl>          <dbl>
 1  1976 ALABAMA   AL       DEMOCRAT         CARTER, …    1182850         659170
 2  1976 ALABAMA   AL       REPUBLICAN       FORD, GE…    1182850         504070
 3  1976 ALASKA    AK       DEMOCRAT         CARTER, …     123574          44058
 4  1976 ALASKA    AK       REPUBLICAN       FORD, GE…     123574          71555
 5  1976 ARIZONA   AZ       DEMOCRAT         CARTER, …     742719         295602
 6  1976 ARIZONA   AZ       REPUBLICAN       FORD, GE…     742719         418642
 7  1976 ARKANSAS  AR       DEMOCRAT         CARTER, …     767535         498604
 8  1976 ARKANSAS  AR       REPUBLICAN       FORD, GE…     767535         267903
 9  1976 CALIFORN… CA       DEMOCRAT         CARTER, …    7803770        3742284
10  1976 CALIFORN… CA       REPUBLICAN       FORD, GE…    7803770        3882244
# ℹ 1,214 more rows
# ℹ 1 more variable: prop_votes <dbl>

Quickly check to make sure everything looks correct. Are there any outliers in votes cast?

ggplot(pres_state_results_df, aes(x = totalvotes)) + 
  geom_boxplot()

Our data are skewed. This is likely because there are large differences in states’ populations, with a few states having very large populations. Let’s check that:

pres_state_results_df |>
  distinct(year, state, totalvotes) |> 
  # Account for changes to population counts over the decades
  group_by(year) |> 
  # Cut the data into five groups based on vote 
  mutate(vote_group = cut_interval(totalvotes, 5, 
                                   labels = c("Rare",
                                              "Medium rare",
                                              "Medium",
                                              "Well done",
                                              "Burnt"))) |> 
  ungroup() |> 
  filter(vote_group %in% c("Well done", "Burnt")) |> 
  distinct(state)
# A tibble: 4 × 1
  state     
  <chr>     
1 CALIFORNIA
2 NEW YORK  
3 FLORIDA   
4 TEXAS     

These are the states with the four largest populations in the country. Those “outliers” are just an artifact of the distribution of people across America.

Question

Can you think of anything else we should check in these data?

Tracking proportion of votes won by state and over time

Now that we have cleaned and checked our data, we can answer our initial research question: What proportion of votes did the Democratic and Republican candidates win in each state in each year?

Let’s play around with the 2020 Presidential election results:

pres_state_results_2020 <- pres_state_results_df |> 
  filter(year == 2020)

Let’s look at some different ways to plot the results by state. First, let’s plot the proportion of votes won by each candidate in each state in a way that makes the comparison easy to see.

pres_state_results_2020 |> 
  mutate(prop_votes = if_else(party_simplified == "REPUBLICAN", -prop_votes, prop_votes)) |> 
  ggplot(aes(x = prop_votes, y = reorder(state, prop_votes), 
             fill = party_simplified)) + 
  geom_col() + 
  geom_vline(xintercept = c(-0.5, 0.5), colour = "grey") + 
  theme_minimal() + 
  theme(legend.position = "none",
        plot.title = ggtext::element_markdown(face = "bold"),
        plot.title.position = "plot") + 
  labs(title = "Percentage of votes won by <span style='color:#4672AB;'>Joe Biden</span> and <span style='color:#AA4841;'>Donald Trump</span> in the 2020 US <br>Presidential Election",
       x = "Percentage of votes won",
       y = NULL) + 
  scale_fill_manual(values = c("#4672AB", "#AA4841")) + 
  scale_x_continuous(breaks = c(-1, -0.5, 0.5, 1),
                     limits = c(-1, 1),
                     labels = c("-1" = "100%",
                                "-0.5" = "50%",
                                "0.5" = "50%",
                                "1" = "100%"))

Next, we can visualize the difference in the proportion of votes won by each candidate:

prop_won_2020 <- pres_state_results_2020 |> 
  # Make sure the order of the parties is consistent
  arrange(state, party_simplified) |> 
  group_by(state) |> 
  mutate(prop_adv = prop_votes - lead(prop_votes)) |>
  ungroup() |> 
  drop_na(prop_adv)
  
ggplot(prop_won_2020, aes(x = prop_adv, y = reorder(state, prop_adv), 
             fill = prop_adv < 0)) + 
  geom_col() + 
  theme_minimal() + 
  theme(legend.position = "none",
        plot.title = ggtext::element_markdown(face = "bold"),
        plot.title.position = "plot") + 
  labs(title = "Advantage won by <span style='color:#4672AB;'>Joe Biden</span> and <span style='color:#AA4841;'>Donald Trump</span> in the 2020 US Presidential <br>Election",
       x = "Percentage of votes won",
       y = NULL) + 
  scale_fill_manual(values = c("#4672AB", "#AA4841")) + 
  scale_x_continuous(labels = scales::percent,
                     limits = c(-1, 1))

Finally, we can plot this on a map:

library(maps)
library(mapdata)

us_states <- map_data("state")

prop_won_sf <- us_states |> 
  left_join(mutate(prop_won_2020, region = str_to_lower(state)),
            by = join_by(region))

ggplot(prop_won_sf, aes(x = long, y = lat, fill = prop_adv, group = group)) + 
  geom_polygon(color = "lightgrey") + 
  guides(fill = FALSE) + 
  theme_void() + 
  coord_fixed(1.3) + 
  scale_fill_gradient2(low = "#AA4841",
                       high = "#4672AB",
                       midpoint = 0)

You can even make this interactive:

library(leaflet)
library(sf)
library(htmltools)

us_states <- read_sf("https://rstudio.github.io/leaflet/json/us-states.geojson")

prop_won_sf <- us_states |>
  mutate(state = str_to_upper(name)) |> 
  left_join(prop_won_2020, by = join_by(state))

mypalette <- colorNumeric(
  palette = c("#AA4841", "white", "#4672AB"),
  domain = c(-1, 1),
  na.color = "transparent"
)

labels <- glue::glue("{prop_won_sf$state}: {scales::percent(prop_won_sf$prop_adv, accuracy = 0.1)}")

leaflet(prop_won_sf) |> 
  setView(-96, 37.8, 4) |> 
  addPolygons(fillColor = ~ mypalette(prop_adv),
              weight = 2,
              color = "white",
              fillOpacity = 1,
              highlightOptions = highlightOptions(weight = 4,
                                                  color = "darkgrey",
                                                  fillOpacity = 0.7,
                                                  bringToFront = T),
              label = labels,
  labelOptions = labelOptions(
    style = list("font-weight" = "normal", padding = "3px 8px"),
    textsize = "15px",
    direction = "auto"))
Caution

Everyone loves a map, but you should use them cautiously. They can be misleading. Large states (by area) are overemphasized. Small states (and DC) can be very difficult to see. Because the importance of a state’s votes to the outcome of the Presidential election is not proportional to its geographic size, this can obscure the outcome you are trying to communicate.